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Abstract 



High-dimensional statistical inference deals with models in which the the number of parame- 
ters p is comparable to or larger than the sample size n. Since it is usually impossible to obtain 
consistent procedures unless p/n — > 0, a line of recent work has studied models with various types 
of low-dimensional structure, including sparse vectors, sparse and structured matrices, low-rank 
matrices, and combinations thereof. In such settings, a general approach to estimation is to 
solve a regularized optimization problem, which combines a loss function measuring how well the 
model fits the data with some regularization function that encourages the assumed structure. 
This paper provides a unified framework for establishing consistency and convergence rates for 
such regularized M-estimators under high-dimensional scaling. We state one main theorem and 
show how it can be used to re-derive some existing results, and also to obtain a number of new 
CN " results on consistency and convergence rates, in both £ 2 -error and related norms. Our analysis 

also identifies two key properties of loss and regularization functions, referred to as restricted 
strong convexity and decomposability, that ensure corresponding regularized M-estimators have 
. fast convergence rates, and which are optimal in many well-studied cases. 



1 Introduction 



High-dimensional statistics is concerned with models in which the ambient dimension of the problem 
p may be of the same order as — or substantially larger than — the sample size n. On one hand, its 
^ ■ roots are quite old, dating back to work on random matrix theory and high-dimensional testing 

problems (e.g, 24, 44 , ) • On the other hand, the past decade has witnessed a tremendous 



surge of research activity. Rapid development of data collection technology is a major driving force: 
it allows for more observations to be collected (larger n), and also for more variables to be measured 
(larger p). Examples are ubiquitous throughout science: astronomical projects such as the Large 
Synoptic Survey Telescope [l| produce terabytes of data in a single evening; each sample is a high- 
resolution image, with several hundred megapixels, so that p S> 10 8 . Financial data is also of a 
high-dimensional nature, with hundreds or thousands of financial instruments being measured and 
tracked over time, often at very fine time intervals for use in high frequency trading. Advances in 
biotechnology now allow for measurements of thousands of genes or proteins, and lead to numerous 
statistical challenges (e.g., see the paper [?J and references therein). Various types of imaging 



technology, among them magnetic resonance imaging in medicine 42J and hyper-spectral imaging 



in ecology [37|, also lead to high-dimensional data sets. 
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In the regime p 3> n, it is well known that consistent estimators cannot be obtained unless 
additional constraints are imposed on the model. Accordingly, there are now several lines of work 
within high-dimensional statistics, all of which are based on imposing some type of low-dimensional 
constraint on the model space, and then studying the behavior of different estimators. Examples 
include linear regression with sparsity constraints, estimation of structured covariance or inverse 
covariance matrices, graphical model selection, sparse principal component analysis, low-rank ma- 
trix estimation, matrix decomposition problems, and estimation of sparse additive non-parametric 
models. The classical technique of regularization has proven fruitful in all of these contexts. Many 
well-known estimators are based on solving a convex optimization problem formed by the sum 
of a loss function with a weighted regularizer; we refer to any such method as a regularized M- 
estimator. For instance, in application to linear models, the Lasso or basis pursuit approach 68|, 20] 
is based on a combination of the least-squares loss with £i-regularization, and so involves solving a 
quadratic program. Similar approaches have been applied to generalized linear models, resulting in 
more general (non-quadratic) convex programs with ^-constraints. Several types of regularization 
have been used for estimating matrices, including standard £i-regularization, a wide range of sparse 
group-structured regularizers, as well as regularization based on the nuclear norm (sum of singular 
values). 



Past work: Within the framework of high-dimensional statistics, the goal is to obtain bounds on 
a given performance metric that hold with high probability for a finite sample size, and provide 
explicit control on the ambient dimension p, as well as other structural parameters such as the 
sparsity of a vector, degree of a graph, or rank of matrix. Typically, such bounds show that the 
ambient dimension and structural parameters can grow as some function of the sample size n, while 
still having the statistical error decrease to zero. The choice of performance metric is application- 
dependent; some examples include prediction error, parameter estimation error, and model selection 
error. 

By now, there are a large number of theoretical results in place for various types of regularized 
M-estnnators0 Sparse linear regression has perhaps been the most active area, and multiple bodies 
of work can be differentiated by the error metric under consideration. They include work on exact 



as well as variable selection consistency (e.g. 



recovery for noiseless observations (e.g., |22j, |2l|, [l5j ) , prediction error consistency (e.g., 25.JlR_72. 
8o|). consistency of the parameter estimates in lo or some other norm (e.g., 13, 12, ll, 80, 48, ijl6|), 



se 

linear regression are also well- understood, and £i-based methods are known to be optimal for l q - 
ball sparsity 57], and near-optimal for model selection For generalized linear models (GLMs), 



estimators based on ^-regularized maximum likelihood have also been studied, including results on 
risk consistency consistency in I2 or ^i-norm 0,131, 46], and model selection consistency 6(], lo| . 
Sparsity has also proven useful in application to different types of matrix estimation problems, 



l4 . |32j ) . Another line of work has 



among them banded and sparse covariance matrices (e.g., 
studied the problem of estimating Gaussian Markov random fields, or equivalently inverse covariance 
matrices with sparsity constraints. Here there are a range of results, including convergence rates in 
Frobenius, operator and other matrix norms 65, 61, 36, [83[, as well as results on model selection 



consistency [6l|, |36j, |47( • Motivated by applications in which sparsity arises in a structured manner 



other researchers have proposed different types of block-structured regularizers (e.g., 7CJ, 1 771 . 171 



1 Given the extraordinary number of papers that have appeared in recent years, it must be emphasized that our 
referencing is necessarily incomplete. 
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), among them the group Lasso based on regularization. High-dimensional 

consistency results have been obtained for exact recovery based on noiseless observations (HtI 01, 
convergence rates in ^2-norm (e.g., [H, 28, |4l|, 0|) as well as model selection consistency (e.g., p4 , 
53, Sj). Problems of low-rank matrix estimation also arise in numerous applications. Techniques 



based on nuclear norm regularization have been studied for different statistical models, including 
compressed sensing 63|, 39 L matrix completion 17, H, 62, 51], multitask regression 78|, 52, 64, U, 3], 
and system identification 2j, 52, 4(| ■ Finally, although the primary emphasis of this paper is on high- 
dimensional parametric models, regularization methods have also proven effective for a class of high- 
dimensional non-parametric models that have additive decomposition (e.g., [El, ESQ), 
and shown to achieve minimax-optimal rates [58j. 



Our contributions: As we have noted previously, almost all of these estimators can be seen 
as particular types of regularized M-estimators, with the choice of loss function, regularizer and 
statistical assumptions changing according to the model. This methodological similarity suggests 
an intriguing possibility: is there a common set of theoretical principles that underlies analysis of 
all these estimators? If so, it could be possible to gain a unified understanding of a large collection 
of techniques for high-dimensional estimation, and afford some insight into the literature. 

The main contribution of this paper is to provide an affirmative answer to this question. In 
particular, we isolate and highlight two key properties of a regularized M-estimator — namely, a de- 
composability property for the regularizer, and a notion of restricted strong convexity that depends 
on the interaction between the regularizer and the loss function. For loss functions and regulariz- 
es satisfying these two conditions, we prove a general result (Theorem Q]) about consistency and 
convergence rates for the associated estimators. This result provides a family of bounds indexed by 
subspaces, and each bound consists of the sum of approximation error and estimation error. This 
general result, when specialized to different statistical models, yields in a direct manner a large 
number of corollaries, some of them known and others novel. In concurrent work, a subset of the 
current authors have also used this framework to prove several results on low-rank matrix estima- 
tion using the nuclear norm j52|, as well as minimax-optimal rates for noisy matrix completion 51 ] 
and noisy matrix decomposition 0]. Finally, en route to establishing these corollaries, we also prove 
some new technical results that are of independent interest, including guarantees of restricted strong 
convexity for group-structured regularization (Proposition [I]) . 

The remainder of this paper is organized as follows. We begin in Section [2] by formulating the 
class of regularized M-estimators that we consider, and then defining the notions of decomposability 
and restricted strong convexity. Section[3]is devoted to the statement of our main result (Theorem[T|), 
and discussion of its consequences. Subsequent sections are devoted to corollaries of this main result 
for different statistical models, including sparse linear regression (Section 2]) and estimators based 
on group-structured regularizers (Section [5]). 



2 Problem formulation and some key properties 

In this section, we begin with a precise formulation of the problem, and then develop some key 
properties of the regularizer and loss function. 
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2.1 A family of M-estimators 

Let Z™ := {Z\, . . . , Z n } denote n identically distributed observations with marginal distribution P, 
and suppose that we are interested in estimating some parameter 9 of the distribution P. Let 
C : W x Z n — > R be a convex and differentiable loss function that, for a given set of observations 
ZJ 1 , assigns a cost £(0;Z") to any parameter £ Let #* G arg min C{9) be any minimizer of 

the population risk C(9) := E,z™{£(9; Z™ )]. In order to estimate this quantity based on the data Z", 
we solve the convex optimization problem 

6 Xn G argmin{£(0;Zr) + A n ^)}, (1) 

where A n > is a user-defined regularization penalty, and 1Z : W — > is a norm. Note that this 
set-up allows for the possibility of mis-specified models as well. 

Our goal is to provide general techniques for deriving bounds on the difference between any 
solution 6\ n to the convex program ([T]) and the unknown vector 9* . In this paper, we derive bounds 
on the quantity \\9\ n — 9*\\, where the error norm || • || is induced by some inner product (•,•) on 
W. Most often, this error norm will either be the Euclidean ^2- norm on vectors, or the analogous 
Frobenius norm for matrices, but our theory also applies to certain types of weighted norms. In 
addition, we provide bounds on the quantity 1Z(6\ n —9*), which measures the error in the regularizer 
norm. In the classical setting, the ambient dimension p stays fixed while the number of observations 
n tends to infinity. Under these conditions, there are standard techniques for proving consistency 
and asymptotic normality for the error 9\ n — 9* . In contrast, the analysis of this paper is all within 
a high-dimensional framework, in which the tuple (n,p), as well as other problem parameters, such 
as vector sparsity or matrix rank etc., are all allowed to tend to infinity. In contrast to asymptotic 
statements, our goal is to obtain explicit finite sample error bounds that hold with high probability. 

2.2 Decomposability of 1Z 

The first ingredient in our analysis is a property of the regularizer known as decomposability, defined 
in terms of a pair of subspaces Ai C Ai of MP. The role of the model subspace Ai is to capture 
the constraints specified by the model; for instance, it might be the subspace of vectors with a 
particular support (see Example [1]), or a subspace of low-rank matrices (see Example [3]). The 
orthogonal complement of the space Ai, namely the set 

M 1 - := {v G R p | (u, v) = for all u G M] (2) 

is referred to as the perturbation subspace, representing deviations away from the model subspace 
A4. In the ideal case, we have M. = Ai 1 - , but our definition allows for the possibility that Ai 
is strictly larger than Ai, so that Ai 1 ' is strictly smaller than Ai ± . This generality is needed for 
treating the case of low-rank matrices and nuclear norm, as discussed in Example [3] to follow. 

Definition 1. Given a pair of subspaces Ai C Ai, a norm-based regularizer 1Z is decomposable 
with respect to (A'l, A^ -1 ) if 

K{9 + 7) = K{9) + n{i) for all 9 G Ai and 7 G A\^~. (3) 
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In order to build some intuition, let us consider the ideal case Ai = Ai for the time being, so that 
the decomposition ([3]) holds for all pairs (9,j) € Ai x Ai^~. For any given pair (#,7) of this form, 
the vector 9 + 7 can be interpreted as perturbation of the model vector 9 away from the subspace 
Ai, and it is desirable that the regularizer penalize such deviations as much as possible. By the 
triangle inequality for a norm, we always have 1Z(9 + j) < 11(6) + 1Z( r y), so that the decomposability 
condition (|3|) holds if and only the triangle inequality is tight for all pairs (#,7) G (Ai, Ai^). It is 
exactly in this setting that the regularizer penalizes deviations away from the model subspace Ai 
as much as possible. 

In general, it is not difficult to find subspace pairs that satisfy the decomposability property. 
As a trivial example, any regularizer is decomposable with respect to Ai = M p and its orthogonal 
complement Ai 1 - = {0}. As will be clear in our main theorem, it is of more interest to find subspace 
pairs in which the model subspace Ai is "small", so that the orthogonal complement Ai 1 - is "large". 
To formalize this intuition, let us define the projection operator 

n_A/f(u) := arg min \\u — v\\, (4) 

with the projection 11^ ± defined in an analogous manner. To simplify notation, we frequently use 
the shorthand u_m = H_m(u) and u M ± = H M ±(u). 

Of interest to us are the action of these projection operators on the unknown parameter 9* € W. 
In the most desirable setting, the model subspace Ai can be chosen such that 9j^ ~ 9*, or equiva- 
lently, such that 9* M± ~ 0. If this can be achieved with the model subspace Ai remaining relatively 
small, then our main theorem guarantees that it is possible to estimate 9* at a relatively fast rate. 
The following examples illustrate suitable choices of the spaces Ai and Ai in three concrete settings, 
beginning with the case of sparse vectors. 

Example 1. Sparse vectors and l\-norm regularization. Suppose the error norm || • || is the usual 
^2-norm, and that the model class of interest is the set of s-sparse vectors in p dimensions. For any 
particular subset SC{l,2,...,j)} with cardinality s, we define the model subspace 

M(S) := {9eR p I 9j = for all j <£ S}. (5) 

Here our notation reflects the fact that Ai depends explicitly on the chosen subset S. By construc- 
tion, we have Uj\4^(9*) = 9* for any vector 9* that is supported on S. 

In this case, we may define Ai{S) = Ai(S), and note that the orthogonal complement with 
respect to the Euclidean inner product is given by 

M ± (S) = M ± (S) = {7 G j 7i = for all j G 5}. (6) 

This set corresponds to the perturbation subspace, capturing deviations away from the set of vectors 
with support S. We claim that for any subset S, the £i-norm 11(6) = \\9\\i is decomposable with 
respect to the pair (Ai(S), Ai- L (S)). Indeed, by construction of the subspaces, any 9 £ Ai(S) can 
be written in the partitioned form 9 = (9s, 0s c ), where 9s 6 M s and 0s c £ W~ s is a vector of zeros. 
Similarly, any vector 7 £ Ai ± (S) has the partitioned representation (05,75c). Putting together the 
pieces, we obtain 

l|0 + 7lli = \\(0s,0) + (0,7s«0lli = ll^lli + IMIi, 
showing that the £i-norm is decomposable as claimed. (} 
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As a follow-up to the previous example, it is also worth noting that the same argument shows 
that for a strictly positive weight vector u, the weighted i\-norm := Yl^i^jl^jl is also de- 
composable with respect to the pair (A / t(S),M(S)). For another natural extension, we now turn to 
the case of sparsity models with more structure. 

Example 2. Group- structured norms. In many applications, sparsity arises in a more struc- 
tured fashion, with groups of coefficients likely to be zero (or non-zero) simultaneously. In or- 
der to model this behavior, suppose that the index set {1,2, can be partitioned into a 
set of Ng disjoint groups, say Q = {G±, G%, ■ ■ ■ , Gn s }- With this set-up, for a given vector 
a = (ai, . . . , a^Vg) £ [1, oo] Ng , the associated (l,a)-group norm takes the form 

Ng 

\\0h,5 '■= ^2 W^GtWat- (7) 

t=l 

For instance, with the choice a = (2,2, . . . , 2) , we obtain the group £± / £2-wicm, corresponding to 



the regularizer that underlies the group Lasso [79j. On the other hand, the choice a = (00, . . . , 00), 
corresponding to a form of block Ixjloz regularization, has also been studied in past work [HI, [53, 81]. 
Note that for a = (1, 1, ... , 1), we obtain the standard t\ penalty. Interestingly, our analysis shows 
that setting a £ [2,00]^ can often lead to superior statistical performance. 

We now show that the norm || • \\g^ is again decomposable with respect to appropriately de- 
fined subspaces. Indeed, given any subset Sg C {1, . . . ,Ng} of group indices, say with cardinality 
sg = \Sg\, we can define the subspace 

M(Sg) := {9£RP I 6 Gt =0 foralH^Sg}, (8) 
as well as its orthogonal complement with respect to the usual Euclidean inner product 

M L {Sg) = M ± (Sg) := {fleR p I 8 Gt = for allied}. (9) 
With these definitions, for any pair of vectors £ A4(Sg) and 7 £ Ai^-[Sg), we have 

¥ + lh,&= 13 W G t + O Gt \U + H°G t +7G t |L = \\0\\g,d+\h\\g,d, (10) 

teSg t^Sg 

thus verifying the decomposability condition. <) 

In the preceding example, we exploited the fact that the groups were non-overlapping in order to 
establish the decomposability property. Therefore, some modifications would be required in order to 
choose the subspaces appropriately for overlapping group regularizers proposed in past work 2^, 3o| . 



Example 3. Low-rank matrices and nuclear norm. Now suppose that each parameter £ ^pi x P2 
is a matrix; this corresponds to an instance of our general set-up with p = P1P2, as long as we 
identify the space M PlXp2 with R PlP2 in the usual way. We equip this space with the inner product 
((G, r)) := trace(Or T ), a choice which yields (as the induced norm) the Frobenius norm 



IQIIIf := V«e, e» 



Pl P2 



j=i k=i 
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In many settings, it is natural to consider estimating matrices that are low-rank; examples include 
principal component analysis, spectral clustering, collaborative filtering, and matrix completion. 
With certain exceptions, it is computationally expensive to enforce a rank-constraint in a direct 
manner, so that a variety of researchers have studied the nuclear norm, also known as the trace 
norm, as a surrogate for a rank constraint. More precisely, the nuclear norm is given by 

min{pi,p 2 } 

i=i 

where {o~j(@)} are the singular values of the matrix 0. 

The nuclear norm is decomposable with respect to appropriately chosen subspaces. Let us 
consider the class of matrices G G M PlXp2 that have rank r < minjpi, j>2}- For any given matrix O, 
we let row(O) C W 2 and col(0) C W 1 denote its row space and column space respectively. Let 
U and V be a given pair of r-dimensional subspaces U C M P1 and V C M P2 ; these subspaces will 
represent left and right singular vectors of the target matrix 0* to be estimated. For a given pair 
(U, V), we can define the subspaces M(U, V) and M L (U, V) of W lXp2 given by 

M(U,V) := {0 G M PlXp2 | row(0) C V, col(0) C U], and (13a) 
M ± {U,V) : = {0 G R PlXp2 | row(0) C V 1 , col(0) C U 1 }. (13b) 

So as to simplify notation, we omit the indices (U, V) when they are clear from context. Unlike the 
preceding examples, in this case, we the set M. is notH equal to M.. 

Finally, we claim that the nuclear norm is decomposable with respect to the pair (A4, M^). By 
construction, any pair of matrices G A4 and T G M. 1 - have orthogonal row and column spaces, 
which implies the required decomposability condition — namely |||0 + r|||i = |||0|||i + |||r|||i. 

A line of recent work (e.g., 19, 27, 2, 0,0, S3]) has studied matrix problems involving the sum 
of a low-rank matrix with a sparse matrix, along with the regularizer formed by a weighted sum of 
the nuclear norm and the elementwise £i-norm. By a combination of Examples [T] and Example O 
this regularizer also satisfies the decomposability property with respect to appropriately defined 
subspaces. 



2.3 A key consequence of decomposability 

Thus far, we have specified a class ([!]) of M-estimators based on regularization, defined the notion of 
decomposability for the regularizer and worked through several illustrative examples. We now turn 
to the statistical consequences of decomposability — more specifically, its implications for the error 
vector A^ n = 9\ n — 0* , where 8 G M. p is any solution of the regularized M-estimation procedure ([1]) . 
For a given inner product (•,•), the dual norm of 1Z is given by 

TZ*(v):= sup ^4 = sup (u,v). (14) 
«eRp\{o} /-C W K(u)<i 

This notion is best understood by working through some examples. 

2 However, as is required by our theory, we do have the inclusion M C M. Indeed, given any Q G M and Y G A4 ± , 
we have Q T T = by definition, which implies that {{&, F}} = trace(0 T T) = 0. Since F £ M 1 ' was arbitrary, we have 
shown that is orthogonal to the space Ai ± , meaning that it must belong to M. 
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Dual of 4-norm: For the £i-norm lZ(u) = \\u\\i previously discussed in Example[U let us compute 
its dual norm with respect to the Euclidean inner product on M. p . For any vector v € W, we have 

p p 
sup (u,v) < sup } \ufe\\vk\ < sup (Y^|«fc|) max \vk\ = IMIoo- 
IMIi<i NU<i^ IHIi<i k^i 

We claim that this upper bound actually holds with equality. In particular, letting j be any index 
for which | Vj | achieves the maximum ||f ||oo — ^n&X-k=i,...,p l^jfcl; suppose that we form a vector u £ ML P 
with Uj = sign(-yj), and ut = for all k 7^ j. With this choice, we have ||S||i < 1, and hence 
=1 UkVk — II u II 00 j showing that the dual of the £i-norm is the ^oo-norm. 

Dual of group norm: Now recall the group norm from Example [2l specified in terms of a vector 
a £ [2, oo] Ng . A similar calculation shows that its dual norm, again with respect to the Euclidean 
norm on MP, is given by 

IMIcr?* = max \\v\\a* where — + 4r = 1 are dual exponents. (15) 

" ny ' a t=l,...,Ng " " ' at^a* * V J 

As special cases of this general duality relation, the block (1,2) norm that underlies the usual group 
Lasso leads to a block (00, 2) norm as the dual, whereas the the block (1, 00) norm leads to a block 
(00, 1) norm as the dual. 



Dual of nuclear norm: For the nuclear norm, the dual is defined with respect to the trace inner 
product on the space of matrices. For any matrix N £ M. PlXp2 , it can be shown that 

TZ*(N) = sup ((M, N)) = I iV I op = max a^N), 

|||M||| nuc <l j=l,...,mm{ Pl , P2 } 

corresponding to the ^-norm applied to the vector c(N) of singular values. In the special case of 
diagonal matrices, this fact reduces to the dual relationship between the vector i\ and norms. 

The dual norm plays a key role in our general theory, in particular by specifying a suitable choice 
of the regularization weight A n . We summarize in the following: 

Lemma 1. Suppose that C is a convex and differentiable function, and consider any optimal solution 
9 to the optimization problem (pQ) with a strictly positive regularization parameter satisfying 

\ n >21l*(X7£(9*;Z?)). (16) 

Then for any pair (A4, M^) over which 7Z is decomposable, the error A = 6\ n — 9* belongs to the 
set 

C(M,M x ;e*):={AeW I K(A M x)<3K(A M )+4K(e* M± )}. (17) 

We prove this result in Appendix lA.il It has the following important consequence: for any decom- 
posable regularizer and an appropriate choice (I16|) of regularization parameter, we are guaranteed 
that the error vector A belongs to a very specific set, depending on the unknown vector 9*. As 
illustrated in Figured! the geometry of the set C depends on the relation between 9* and the model 
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subspace Ai. When 6* £ M, then we are guaranteed that TZ(9* M ^) = 0. In this case, the con- 
straint (fTTl) reduces to TZ(Aj^±) < 3TZ(Aj^), so that C is a cone, as illustrated in panel (a). In 
the more general case when 9* ^ M so that 1Z(9* M± ) ^ 0, the set C is not a cone, but rather a 
star-shaped set (panel (b)). As will be clarified in the sequel, the case 9* ^ M. requires a more 
delicate treatment. 




Figure 1. Illustration of the set C(M, M ±m , 9*) in the special case A = (A x , A 2 , A 3 ) € R 3 and 
regularizer TZ(A) = ||A||i, relevant for sparse vectors (Example [TJ. This picture shows the case 
S = {3}, so that the model subspace is M.(S) = {A e R 3 | A x = A 2 = 0}, and its orthogonal 
complement is given by M- 1 ^) = {A e M 3 | A 3 = 0}. (a) In the special case when 9\ = Qi, = 0, 
so that 9* e M, the set C(.M,.M ± ;#*) is a cone, (b) When 9* does not belong to M, the set 
C(A4, A4 ± ; 9*) is enlarged in the co-ordinates (Ai, A 2 ) that span A4 ± . It is no longer a cone, but is 
still a star-shaped set. 



2.4 Restricted strong convexity 

We now turn to an important requirement of the loss function, and its interaction with the statistical 
model. Recall that A = 9\ n — 0* is the difference between an optimal solution 6\ n and the true 
parameter, and consider the loss differenc^l C{6\ n ) — C{6*). In the classical setting, under fairly 
mild conditions, one expects that that the loss difference should converge to zero as the sample size 
n increases. It is important to note, however, that such convergence on its own is not sufficient to 
guarantee that 9\ n and 9* are close, or equivalently that A is small. Rather, the closeness depends 
on the curvature of the loss function, as illustrated in Figure [2) In a desirable setting (panel (a)), 
the loss function is sharply curved around its optimum 9\ n , so that having a small loss difference 
\C{9*) — £{9\ n )\ translates to a small error A = 9\ n — 9*. Panel (b) illustrates a less desirable 
setting, in which the loss function is relatively flat, so that the loss difference can be small while the 
error A is relatively large. 

The standard way to ensure that a function is "not too flat" is via the notion of strong convexity. 

3 To simplify notation, we frequently write C{9) as shorthand for C{9\ Z") when the underlying data Z" is clear 
from context. 
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(a) 



Figure 2. Role of curvature in distinguishing parameters, (a) Loss function has high curvature around 
A. A small excess loss dC = \C(9\ n ) — C{0*)\ guarantees that the parameter error A = 0\ n — 9* is 
also small, (b) A less desirable setting, in which the loss function has relatively low curvature around 
the optimum. 

Since C is differentiable by assumption, we may perform a first-order Taylor series expansion at 9*, 
and in some direction A; the error in this Taylor series is given by 

6£(A,6*) := C{9* + A) -£(#*) - (V£(0*),A). (18) 

One way in which to enforce that C is strongly convex is to require the existence of some positive 
constant n > such that 6C(A, 9*) > k||A|| 2 for all A £ W p in a neighborhood of 9*. When the loss 
function is twice differentiable, strong convexity amounts to lower bound on the eigenvalues of the 
Hessian X7 2 C(9), holding uniformly for all 9 in a neighborhood of 9*. 

Under classical "fixed p, large n" scaling, the loss function will be strongly convex under mild 
conditions. For instance, suppose that population risk C is strongly convex, or equivalently, that 
the Hessian V 2 C(9) is strictly positive definite in a neighborhood of 9*. As a concrete example, 
when the loss function C is defined based on negative log likelihood of a statistical model, then the 
Hessian \7 2 C(9) corresponds to the Fisher information matrix, a quantity which arises naturally in 
asymptotic statistics. If the dimension p is fixed while the sample size n goes to infinity, standard 
arguments can be used to show that (under mild regularity conditions) the random Hessian V 2 £(#) 
converges to V 2 C(9) uniformly for all 9 in an open neighborhood of 9*. In contrast, whenever 
the pair (n,p) both increase in such a way that p > n, the situation is drastically different: the 
Hessian matrix \7 2 C(9) is often singular. As a concrete example, consider linear regression based on 
samples = (yi,Xi) £ 1R x MP, for i = 1, 2, . . . ,n. Using the least-squares loss C{9) = ^\\y — X9W 2 ,, 
the p x p Hessian matrix V 2 C(9) = ^X T X has rank at most n, meaning that the loss cannot be 
strongly convex when p > n. Consequently, it impossible to guarantee global strong convexity, so 
that we need to restrict the set of directions A in which we require a curvature condition. 

Ultimately, the only direction of interest is given by the error vector A = 9\ n — 9* . Recall that 
Lemma Q] guarantees that, for suitable choices of the regularization parameter A n , this error vector 
must belong to the set C(A4, M. L ; 9*), as previously defined (fT7|) . Consequently, it suffices to ensure 
that the function is strongly convex over this set, as formalized in the following: 
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(a) (b) 

Figure 3. (a) Illustration of a generic loss function in the high-dimensional p > n setting: it is 
curved in certain directions, but completely flat in others, (b) When 9* ^ A4, the set C(M, A4 ± ; 0*) 
contains a ball centered at the origin, which necessitates a tolerance term tc(9*) > in the definition 
of restricted strong convexity. 



Definition 2. The loss function satisfies a restricted strong convexity (RSC) condition with 
curvature kc > and tolerance function tc if 

5£(A,9*) > k c ||A|| 2 - t 2 c {9*) for all A G C(.M,.M- L ;0*). (19) 

In the simplest of cases — in particular, when 9* S A4 — there are many statistical models for which 
this RSC condition holds with tolerance tc(9*) = 0. In the more general setting, it can hold only 
with a non-zero tolerance term, as illustrated in Figure [3^b). As our proofs will clarify, we in fact 
require only the lower bound (|19p to hold for the intersection of C with a local ball {||A|| < R} of 
some radius centered at zero. As will be clarified later, this restriction is not necessary for the least- 
squares loss, but is essential for more general loss functions, such as those that arise in generalized 
linear models. 

We will see in the sequel that for many loss functions, it is possible to prove that with high 
probability the first-order Taylor series error satisfies a lower bound of the form 

8£(A,e*) > ki ||A|| 2 - K 2 g(n,p)TZ 2 {A) for all ||A|| < 1, (20) 

where k\,K2 are positive constants, and g(n,p) is a function of the sample size n and ambient di- 
mension p, decreasing in the sample size. For instance, in the case of ^i-regularization, for covariates 
with suitably controlled tails, this type of bound holds for any least squares loss with the function 
g(n,p) = -5§L£; S ee equation (|3ip to follow. For generalized linear models and the ^i-norm, a similar 
type of bound is given in equation (|43|) . We also provide a bound of this form for the least-squares 
loss group-structured norms in equation (|46|) . with a different choice of the function g depending on 
the group structure. 

A bound of the form (|20p implies a form of restricted strong convexity as long as TZ(A) is not 
"too large" relative to ||A||. In order to formalize this notion, we define a quantity that relates the 
error norm and the regularizer: 
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Definition 3 (Subspace compatibility constant). For any subspace Ai of W, the subspace com- 
patibility constant with respect to the pair (1Z, \\ ■ ||) is given by 

*(M) := sup (21) 

mga^\{o} IMI 

This quantity reflects the degree of compatibility between the regularizer and the error norm over 
the subspace Ai. In alternative terms, it is the Lipschitz constant of the regularizer with respect 
to the error norm, restricted to the subspace Ai. As a simple example, if Ai is a s-dimensional 
co-ordinate subspace, with regularizer 7Z(u) = \\u\\i and error norm \\u\\ = \\u\\2, then we have 
V(M) = JS. 

This compatibility constant appears explicitly in the bounds of our main theorem, and also arises 
in establishing restricted strong convexity. Let us now illustrate how it can be used to show that 
the condition (|20p implies a form of restricted strong convexity. To be concrete, let us suppose that 
9* belongs to a subspace Ai] in this case, membership of A in the set C(Ai,A4^~;9*) implies that 
1Z(Aj^±) < STZ(Aj^). Consequently, by triangle inequality and the definition (f2Tj) . we have 

K(A) <K(A M± ) + K(A M ) < 4K(A M ) < W(M)\\A\\. 

Therefore, whenever a bound of the form (|20p holds and 9* G Ai, we are guaranteed that 

5£(A,9*) > {/ci - l6n 2 y 2 {M)g(n,p)}\\A\\ 2 for all ||A|| < 1. 

Consequently, as long as the sample size is large enough that 16^2 ^ 2 (Ai)g(n,p) < the restricted 
strong convexity condition will hold with nc = ^ an d t~c(9*) = 0. We make use of arguments of 
this flavor throughout this paper. 



3 Bounds for general M-estimators 

We are now ready to state a general result that provides bounds and hence convergence rates for 
the error \\9\ n — 9*\\, where 9\ n is any optimal solution of the convex program (p}. Although it 
may appear somewhat abstract at first sight, this result has a number of concrete and useful con- 
sequences for specific models. In particular, we recover as an immediate corollary the best known 
results about estimation in sparse linear models with general designs @, |4^|, as well as a number of 
new results, including minimax-optimal rates for estimation under ^-sparsity constraints and esti- 
mation of block-structured sparse matrices. In results that we report elsewhere, we also apply these 



theorems to establishing results for sparse generalized linear models 50(] , estimation of low-rank ma- 



trices 



51, 52], matrix decomposition problems [3], and sparse non-parametric regression models 



Let us recall our running assumptions on the structure of the convex program (pQ). 

(Gl) The regularizer 1Z is a norm, and is decomposable with respect to the subspace pair (Ai,Ai^~), 
where M C M. 

(G2) The loss function C is convex and differentiable, and satisfies restricted strong convexity with 
curvature kc and tolerance 7£. 
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The reader should also recall the definition (|2ip of the subspace compatibility constant. With this 
notation, we can now state the main result of this paper: 

Theorem 1 (Bounds for general models). Under conditions (Gl) and (G2), consider the prob- 
lem dH) based on a strictly positive regularization constant A n > 21Z* (V C{9*)) . Then any optimal 
solution 9\ n to the convex program (HJ satisfies the bound 

0x n - 9*\\ 2 < 94 ^\M) + — {2r 2 c (9*) + mO* M ±)h (22) 
Remarks: Let us consider in more detail some different features of this result. 



(a) It should be noted that Theorem Q] is actually a deterministic statement about the set of op- 
timizers of the convex program (pQ) for a fixed choice of A n . Although the program is convex, it 
need not be strictly convex, so that the global optimum might be attained at more than one point 
0\ n . The stated bound holds for any of these optima. Probabilistic analysis is required when 
Theorem [1] is applied to particular statistical models, and we need to verify that the regularizer 
satisfies the condition 

A„ > 2K*(V£{9*)), (23) 

and that the loss satisfies the RSC condition. A challenge here is that since 9* is unknown, it is 
usually impossible to compute the right-hand side of the condition (123j) . Instead, when we derive 
consequences of Theorem Q] for different statistical models, we use concentration inequalities in 
order to provide bounds that hold with high probability over the data. 

(b) Second, note that Theorem Q] actually provides a family of bounds, one for each pair (M, Ai^) 
of subspaces for which the regularizer is decomposable. Ignoring the term involving T£ for the 
moment, for any given pair, the error bound is the sum of two terms, corresponding to estimation 
error £ crr and approximation error £ app , given by (respectively) 

£ crr := 9^ 2 (M), and := 4^TZ(9* M± ). (24) 

As the dimension of the subspace M increases (so that the dimension of M. 1 - decreases), the 
approximation error tends to zero. But since M C A4, the estimation error is increasing at the 
same time. Thus, in the usual way, optimal rates are obtained by choosing M and A4 so as to 
balance these two contributions to the error. We illustrate such choices for various specific models 
to follow. 

(c) As will be clarified in the sequel, many high-dimensional statistical models have an unidentifiable 
component, and the tolerance term tq, reflects the degree of this non-identifiability. 

A large body of past work on sparse linear regression has focused on the case of exactly sparse 
regression models for which the unknown regression vector 9* is s-sparse. For this special case, recall 
from Example [1] in Section 12.21 that we can define an s-dimensional subspace M that contains 9* . 
Consequently, the associated set C(A4, A^ -1 ; 9*) is a cone (see Figure Ufa)), and it is thus possible to 
establish that restricted strong convexity (RSC) holds with tolerance parameter tc(9*) = 0. This 
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same reasoning applies to other statistical models, among them group-sparse regression, in which 
a small subset of groups are active, as well as low-rank matrix estimation. The following corollary 
provides a simply stated bound that covers all of these models: 

Corollary 1. Suppose that, in addition to the conditions of Theorem^ the unknown 9* belongs to 
M. and the RSC condition holds over C(A4,A4, 9*) with tc(9*) = 0. Then any optimal solution 9\ n 
to the convex program ([1]) satisfies the bounds 

and (25a) 

(25b) 

Focusing first on the bound (|25ap . it consists of three terms, each of which has a natural interpreta- 
tion. First, it is inversely proportional to the RSC constant K£, so that higher curvature guarantees 
lower error, as is to be expected. The error bound grows proportionally with the subspace com- 
patibility constant fy(A4), which measures the compatibility between the regularizer 1Z and error 
norm || • || over the subspace M (see Definition [3]) . This term increases with the size of subspace Ai, 
which contains the model subspace A4. Third, the bound also scales linearly with the regularization 
parameter A n , which must be strictly positive and satisfy the lower bound (|23j) . The bound (|25bD 
on the error measured in the regularizer norm is similar, except that it scales quadratically with 
the subspace compatibility constant. As the proof clarifies, this additional dependence arises since 
the regularizer over the subspace M is larger than the norm || • || by a factor of at most fy(A4) (see 
Definition [3|) . 

Obtaining concrete rates using Corollary Q] requires some work in order to verify the conditions 
of Theorem [H and to provide control on the three quantities in the bounds (|25aj) and (|25bj) . as 
illustrated in the examples to follow. 



\\0 Xn -0*\\ < 9^ 2 (M), 
n(9 Xn -9*) < 12— V 2 (M). 



4 Convergence rates for sparse regression 

As an illustration, we begin with one of the simplest statistical models, namely the standard linear 
model. It is based on n observations Zj = (xj, yi) G MP x R of covariate-response pairs. Let y G M n 
denote a vector of the responses, and let X G M nxp be the design matrix, where Xi G M p is the i th 
row. This pair is linked via the linear model 

y = X9*+w, (26) 

where 9* G M p is the unknown regression vector, and w G M n is a noise vector. To begin, we focus 
on this simple linear set-up, and describe extensions to generalized models in Section [4.41 

Given the data set Zf = (y, X) G MP x M nxp , our goal is to obtain a "good" estimate 9 of 
the regression vector 9*, assessed either in terms of its ^-error \\9 — 9*\\2 or its £i-error \\9 — 9*\\\. 
It is worth noting that whenever p > n, the standard linear model (|26p is unidentifiable in a 
certain sense, since the rectangular matrix X G M nxp has a nullspace of dimension at least p — n. 
Consequently, in order to obtain an identifiable model — or at the very least, to bound the degree of 
non-identifiability — it is essential to impose additional constraints on the regression vector 9* . One 
natural constraint is some type of sparsity in the regression vector; for instance, one might assume 
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that 9* has at most s non-zero coefficients, as discussed at more length in Section [4T2l More generally, 
one might assume that although 9* is not exactly sparse, it can be well-approximated by a sparse 
vector, in which case one might say that 9* is "weakly sparse", "sparsifiable" or "compressible". 
Section 14.31 is devoted to a more detailed discussion of this weakly sparse case. 

A natural M-estimator for this problem is the Lasso 2(3, 68], obtained by solving the ^-penalized 
quadratic program 

G ^ h ~ ml + KMl ^ (27) 

for some choice A n > of regularization parameter. Note that this Lasso estimator is a par- 
ticular case of the general M-estimator ([1]), based on the loss function and regularization pair 
C{6; ZD = ±\\y - X6\\l and 71(9) = £? =1 \9j\ = \\6\h- We now show how Theorem Q] can be 
specialized to obtain bounds on the error 9\ n — 9* for the Lasso estimate. 



4.1 Restricted eigenvalues for sparse linear regression 

For the least-squares loss function that underlies the Lasso, the first-order Taylor series expansion 
from Definition [2] is exact, so that 

6£(A,0*) = (A,-X T XA) = -||XA|||. 

n n 

Thus, in this special case, the Taylor series error is independent of 8*, a fact which allows for sub- 
stantial theoretical simplification. More precisely, in order to establish restricted strong convexity, it 
suffices to establish a lower bound on ||XA|||/n that holds uniformly for an appropriately restricted 
subset of p-dimensional vectors A. 

As previously discussed in Example [TJ for any subset S C {1,2, ... ,p}, the £i-norm is decom- 
posable with respect to the subspace M(S) = {9 £ W \ 9s c = 0} and its orthogonal complement. 
When the unknown regression vector 9* £ W is exactly sparse, it is natural to choose S equal to 
the support set of 9*. By appropriately specializing the definition fjlTf) of C, we are led to consider 
the cone 



C{S) := {A£l p | ||A S c||i < 3||A s ||i}. (28) 

See Figure QJa) for an illustration of this set in three dimensions. With this choice, restricted strong 
convexity with respect to the ^-norm is equivalent to requiring that the design matrix X satisfy 
the condition 

II X8W 2 

2 > k c \\6\\l for all 9 G C(5). (29) 



n 



This lower bound is a type of restricted eigenvalue (RE) condition, and has been studied in past 
work on basis pursuit and the Lasso (e.g., 0, 0, 57, [l3]). One could also require that a related 
condition hold with respect to the £i-norm — viz. 

l|Y79||2 \\a\\2 

" " 2 > k? c -Sr for a11 6 G c ( s )- ( 3 °) 
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This type of ^i-based RE condition is less restrictive than the corresponding ^-version (|29p . We 



refer the reader to the paper by van de Geer and Btihlmann 72| for an extensive discussion of 
different types of restricted eigenvalue or compatibility conditions. 

It is natural to ask whether there are many matrices that satisfy these types of RE conditions. 
If X has i.i.d. entries following a sub-Gaussian distribution (including Gaussian and Bernoulli 
variables as special cases), then known results in random matrix theory imply that the restricted 



isometry property 161 ] holds with high probability, which in turn implies that the RE condition 
holds [91,172]. 

Since statistical applications involve design matrices with substantial dependency, it is 
natural to ask whether an RE condition also holds for more general random designs. This question 



was addressed by Raskutti et al. [571. l56j|. who showed that if the design matrix X E R nxp is formed 
by independently sampling each row X{ ~ N(0, X), referred to as the T,-Gaussian ensemble, then 
there are strictly positive constants (k±, K2), depending only on the positive definite matrix S, such 
that 



> «i - k 2 — ||0||? for all 6 E MP (31) 



l|2 
112 

n —-'■"«■ ~ n 

with probability greater than 1 — c\ exp(— c-ya). The bound (131 1) has an important consequence: it 
guarantees that the RE property (|29j) holdaj with ft£ = y > as long as n > 64(k2/k±) slogp. 
Therefore, not only do there exist matrices satisfying the RE property (|29p . but any matrix sampled 
from a S-Gaussian ensemble will satisfy it with high probability. Related analysis by Rudelson and 
Zhou [|36[ extends these types of guarantees to the case of sub-Gaussian designs, also allowing for 
substantial dependencies among the covariates. 

4.2 Lasso estimates with exact sparsity 

We now show how Corollary [1] can be used to derive convergence rates for the error of the Lasso 
estimate when the unknown regression vector 6* is s-sparse. In order to state these results, we 
require some additional notation. Using Xj E W 1 to denote the j th column of X, we say that X is 
column-normalized if 

II X II 

11 J " 2 < 1 for all j = 1,2,..., p. (32) 



n 

Here we have set the upper bound to one in order to simplify notation. This particular choice 
entails no loss of generality, since we can always rescale the linear model appropriately (including 
the observation noise variance) so that it holds. 

In addition, we assume that the noise vector w E W 1 is zero- mean and has sub- Gaussian tails, 
meaning that there is a constant a > such that for any fixed \\v H2 = 1, 

r2 

F[\{v,w)\ > t] < 2exp ( -) for all 5 > 0. (33) 

For instance, this condition holds when the noise vector w has i.i.d. N(0, 1) entries, or consists 
of independent bounded random variables. Under these conditions, we recover as a corollary of 
Theorem [T] the following result: 



4 To see this fact, note that for any 6 G C(5), we have < 4||0s||i < 4 v^ll^slh- Given the lower bound (pTj) . 
for any 9 £ C(S), we have the lower bound > { Kl -4k 2 J^^} ||0]| 2 > ^Ph, where final inequality follows 

as long as n > 64(k2/^i) 2 slogp. 
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Corollary 2. Consider an s-sparse instance of the linear regression model (|26p such that X satisfies 
the RE condition (I29D . and the column normalization condition (|32p . Given the Lasso program (|27p 

with regularization parameter X n = 4<r y^^p, then with probability at least 1 — c\ exp(— C2^A n ), any 

optimal solution 9\ n satisfies the bounds 

wo a*ii2 64 ff2 sl °gP j im a*\\ s 24cT / lQ gP /o>i\ 

\\0\ n -0 2 < — . #A n -0 i < si . (34 

k, c n kc V n 

Although error bounds of this form are known from past work (e.g., [jj, 0, [3]), our proof illumi- 
nates the underlying structure that leads to the different terms in the bound — in particular, see 
equations (|25ap and (|25bp in the statement of Corollary [TJ 

Proof. We first note that the RE condition (|30p implies that RSC holds with respect to the subspace 
Ai(S). As discussed in Example [U the ^i-norm is decomposable with respect to Ai(S) and its or- 
thogonal complement, so that we may set j\4(S) = Ai(S). Since any vector 9 € Ai(S) has at most s 

non-zero entries, the subspace compatibility constant is given by ^(Ai (S)) = sup Jl|lp = y^. 

eeM(S)\{o} 11 112 

The final step is to compute an appropriate choice of the regularization parameter. The gradient 
of the quadratic loss is given by \7£(9; (y,X)) = ^X T w, whereas the dual norm of the £i-norm is 
the loo-norm. Consequently, we need to specify a choice of A n > such that 

A„ > 2ft*(V£(0*)) = 2||- I -a: t u/|| oo 

with high probability. Using the column normalization (|32p and sub-Gaussian (|33p conditions, for 
each j = 1, . . . ,p, we have the tail bound F[\(Xj,w) /n\ > t] < 2exp ( — Jpr)- Consequently, by 

union bound, we conclude that P[|| X T w/ri||oo > t] < 2 exp ( - ^ + logp) . Setting t = — , we 

see that the choice of A n given in the statement is valid with probability at least l — c\ exp(— C2nA n ). 
Consequently, the claims (|34p follow from the bounds ()25aj) and (I25bp in Corollary [H □ 

4.3 Lasso estimates with weakly sparse models 

We now consider regression models for which 6* is not exactly sparse, but rather can be approximated 
well by a sparse vector. One way in which to formalize this notion is by considering the £ q "ball" of 
radius R q , given by 

p 

M q (R q ) := {9 G R p | \ i\ q ^ R q}' where <? G [°> X ] is fixed - 

i=l 

In the special case q = 0, this set corresponds to an exact sparsity constraint — that is, 9* € Bo(-Ro) 
if and only if 9* has at most Rq non-zero entries. More generally, for q G (0,1], the set M q (R q ) 
enforces a certain decay rate on the ordered absolute values of 9* . 

In the case of weakly sparse vectors, the constraint set C takes the form 

C{M,M;9*) = {AeR p \ \\A s 4i < 3\\A s \\i +4||0£„||i}. (35) 
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In contrast to the case of exact sparsity, the set C is no longer a cone, but rather contains a ball 
centered at the origin — compare panels (a) and (b) of Figure [B Asa consequence, it is never 
possible to ensure that ||X0||2/\/ra is uniformly bounded from below for all vectors 9 in the set (|35p . 
and so a strictly positive tolerance term tc(0*) > is required. The random matrix result (|3T|) . 
stated in the previous section, allows us to establish a form of RSC that is appropriate for the setting 
of ^g-ball sparsity. We summarize our conclusions in the following: 



Corollary 3. Suppose that X satisfies the RE condition (|3ip as well as the column normalization 
condition (132H . the noise w is sub-Gaussian (133p . and 9* belongs to M q (R q ) for a radius R q such 

that y/Rq (^p) 5 3 < 1- Then if we solve the Lasso with regularization parameter X n = 4cry / ^jp, 

there are universal positive constants (cq,ci,C2) such that any optimal solution 9\ n satisfies 

ii^-^<w4^r f (36) 



with probability at least 1 — c\ exp(— C2n\^ l ). 



Remarks: Note that this corollary is a strict generalization of Corollary [2j to which it reduces 
when q = 0. More generally, the parameter q £ [0, 1] controls the relative "sparsifiability" of 9*, with 
larger values corresponding to lesser sparsity. Naturally then, the rate slows down as q increases 
from towards 1. In fact, Raskutti et al. [57( show that the rates (I36p are minimax-optimal over 
the ^j-balls — implying that not only are the consequences of Theorem [1] sharp for the Lasso, but 
more generally, no algorithm can achieve faster rates. 

Proof. Since the loss function C is quadratic, the proof of Corollary [2] shows that the stated choice 

An = 4 \j a l ° gp is valid with probability at least 1 — cexp(— c'nA n ). Let us now show that the RSC 
condition holds. We do so via condition (j3T|) applied to equation ([35]) , For a threshold rj > to be 
chosen, define the thresholded subset 

S v :={je{l,2,...,p} | \9*\> v }. (37) 

Now recall the subspaces M(Srj) and M. ± (S V ) previously defined in equations (J5j) and © of Ex- 
ample [H where we set S = S™. The following lemma, proved in Appendix [Bl provides sufficient 
conditions for restricted strong convexity with respect to these subspace pairs: 

Lemma 2. Suppose that the conditions of Corollary hold, and n > S^IS'nl logp. Then with 
the choice rj = the RSC condition holds over C(A4(S r] ),Ai- L (S ri ),9*) with kc = «i/4 and 

4 = 8K 2 ^\\9* Sc ji 

Consequently, we may apply Theorem Q] with kc = «i/4 and t^(9*) = 8K2^^-\\9g c \\1 to conclude 
that 

\\9 Xn - <T\\l < 144^ \S V \ + ^{l6n 2 l ^\\eU\l + 4||^||i), (38) 
where we have used the fact that ^ 2 (S"„) = |>f?J, as noted in the proof of Corollary [2j 
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It remains to upper bound the cardinality of S v in terms of the threshold rj and £ q -bal\ radius 
R q . Note that we have 

R q >Y,\ 9 *j\ q ^ E \ 6 *\ q ^ ^i^i' ( 39 ) 

whence |5^| < r]~ q R q for any r] > 0. Next we upper bound the approximation error ||0£c||i, using 
the fact that 9* £ M q (R q ). Letting denote the complementary set 5 r? \{l, 2, . . . , p}, we have 

n^iii = E = E i^i'i^-i 1 " 9 ^ R ^ 1 ~ q - ( 4 °) 

Setting rj = X n /Ki and then substituting the bounds (|39p and (|40p into the bound ()38[) yields 

For any fixed noise variance, our choice of regularization parameter ensures that the ratio ;pr~ 
is of order one, so that the claim follows. □ 



4.4 Extensions to generalized linear models 

In this section, we briefly outline extensions of the preceding results to the family of generalized 
linear models (GLM). Suppose that conditioned on a vector x £ MP of covariates, a response variable 
y £ y has the distribution 

m. / \ r y(0*,x)-Q((6*,x)), ,„ , 

F e *(y g)ocexp{ tfV , }■ 41 

c(cr) 

Here the quantity c(cr) is a fixed and known scale parameter, and the function <I> : R — > M is the link 
function, also known. The family (j41 j) includes many well-known classes of regression models as spe- 
cial cases, including ordinary linear regression (obtained with y = R, $(i) = i 2 /2 and c(cr) = a 2 ), 
and logistic regression (obtained with y = {0, 1}, c(cr) = 1 and <&(i) = log(l + exp(i))). 

Given samples Zi = (xi,yi) £ M p x y, the goal is to estimate the unknown vector 9* £ R p . 
Under a sparsity assumption on 9*, a natural estimator is based on minimizing the (negative) log 
likelihood, combined with an ^i-regularization term. This combination leads to the convex program 

1 n 

9 Xn £ arg min { - V { - yi (0,Xi) + +A n ||0||i}. (42) 

v „ ' 

In order to extend the error bounds from the previous section, a key ingredient is to establish 
that this GLM-based loss function satisfies a form of restricted strong convexity. Along these lines, 
Negahban et al. [oil ] proved the following result: suppose that the covariate vectors xi are zero-mean 
with covariance matrix E y 0, and are drawn i.i.d. from a distribution with sub-Gaussian tails (see 
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equation (|33|) ) . Then there are constants k\ , k% such that the first-order Taylor series error for the 
GLM-based loss (|42f) satisfies the lower bound 

&C(A,0*) > Ki||A||| -k 2 ^^||A||i for all ||A|| 2 < 1. (43) 

As discussed following Definition [2j this type of lower bound implies that C satisfies a form of RSC, 
as long as the sample size scales as n = fl(slogp), where s is the target sparsity. Consequently, this 
lower bound (143|) allows us to recover analogous bounds on the error \\9\ n — 0*\\2 of the GLM-based 
estimator (|4"2]) . 



5 Convergence rates for group-structured norms 

The preceding two sections addressed M-estimators based on ^i-regularization, the simplest type 
of decomposable regularizer. We now turn to some extensions of our results to more complex 
regularizers that are also decomposable. Various researchers have proposed extensions of the Lasso 
based on regularizers that have more structure than the l\ norm (e.g., 0,0 EBB). Such 



regularizers allow one to impose different types of block-sparsity constraints, in which groups of 
parameters are assumed to be active (or inactive) simultaneously. These norms arise in the context 
of multivariate regression, where the goal is to predict a multivariate output in W 71 on the basis 
of a set of p covariates. Here it is appropriate to assume that groups of covariates are useful for 
predicting the different elements of the m-dimensional output vector. We refer the reader to the 

papers 0, 0, M, 0, 3 

for further discussion of and motivation for the use of block-structured 

norms. 

Given a collection Q = {G\, . . . , Gj^ g } of groups, recall from Example [2] in Section 12.21 the 
definition of the group norm || • \\g^- In full generality, this group norm is based on a weight vector 
a = («i, . . . , ol Ng ) S [2, oo] N5 , one for each group. For simplicity, here we consider the case when 
at = a for all t = 1, 2, . . . , Ng, and we use || • \\g tCl to denote the associated group norm. As a natural 
extension of the Lasso, we consider the block Lasso estimator 

0€ a vgmm{-\\y-Xe\\ 2 2 + X n \\e\\g >a }, (44) 

where X n > is a user-defined regularization parameter. Different choices of the parameter a yield 
different estimators, and in this section, we consider the range a £ [2, oo]. This range covers the two 
most commonly applied choices, a = 2, often referred to as the group Lasso, as well as the choice 
a = +oo. 



5.1 Restricted strong convexity for group sparsity 

As a parallel to our analysis of ordinary sparse regression, our first step is to provide a condition 
sufficient to guarantee restricted strong convexity for the group-sparse setting. More specifically, we 
state the natural extension of condition (|3ip to the block-sparse setting, and prove that it holds with 
high probability for the class of S-Gaussian random designs. Recall from Theorem [1] that the dual 
norm of the regularizer plays a central role. As discussed previously, for the block-(l, a)-regularizer, 
the associated dual norm is a block-(oo, a*) norm, where (a, a*) are conjugate exponents satisfying 
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Letting e ~ N(0, I pX p) be a standard normal vector, we consider the following condition. Suppose 
that there are strictly positive constants (k\, H2) such that, for all A 6 W, we have 



IXAII 2 

2 > ki||A|| 2 -K 2 p 2 g (a*) II A||? where Pg (a*) := E 



n 



max Jlf^JL 



=l,2,...,iVg V™ 



(45) 



To understand this condition, first consider the special case of Ng = p groups, each of size one, so 
that the group-sparse norm reduces to the ordinary ^i-norm, and its dual is the ^-norm. Using 

a = 2 for concreteness, we have pg(2) = E[||e , || 00 ]/- v /n < \J 3 1 ° gp , using standard bounds on 
Gaussian maxima. Therefore, condition (|45|) reduces to the earlier condition (|31|) in this special 
case. 

Let us consider a more general setting, say with a = 2 and Ng groups each of size m, so that 
p = Ngm. For this choice of groups and norm, we have P g(2) = E[ max v ^ ] where each 

sub-vector wg ± is a standard Gaussian vector with m elements. Since EfHegJ^] < \/rn, tail bounds 
for x 2 -variates yield pg(2) < \/^ + y^ 31 "^^ ", so that the condition (j4"5j) is equivalent to 

HXAI' 2 



1? 



2 > Kl || AH2 — K2 



m 1 3 log Ng 1 ' 



n v n 



A||| .2 for all A G RP. (46) 



Thus far, we have seen the form that condition (|45p takes for different choices of the groups and 
parameter a. It is natural to ask whether there are any matrices that satisfy the condition (|45|) . 
As shown in the following result, the answer is affirmative — more strongly, almost every matrix 
satisfied from the S-Gaussian ensemble will satisfy this condition with high probability. (Here we 
recall that for a non-degenerate covariance matrix, a random design matrix X £ M. nxp is drawn 
from the E-Gaussian ensemble if each row x% ~ iV(0, X), i.i.d. for i = 1,2, . . . , n.) 

Proposition 1. For a design matrix X £ M. nxp from the Ti-ensemble, there are constants (^1,^2) 
depending only E such that condition f|45|) holds with probability greater than 1 — ci exp(— C2n). 

We provide the proof of this result in Appendix IC.li This condition can be used to show that appro- 
priate forms of RSC hold, for both the cases of exactly group-sparse and weakly sparse vectors. As 
with li -regularization, these RSC conditions are milder than analogous group-based RIP conditions 
(e.g., 28, 67, which require that all sub-matrices up to a certain size are close to isometries. 

5.2 Convergence rates 

Apart from RSC, we impose one additional condition on the design matrix. For a given group G of 
size to, let us view the matrix Xq G M. nxm as an operator from £™ — > P^, and define the associated 
operator norm ||| Ag||| q ,_ > 2 := max \Xq #||2- We then require that 



?IU=i 

III X Gt \la^2 



<1 for all t = 1,2, ...,N g . (47) 



Note that this is a natural generalization of the column normalization condition (I32p . to which it 
reduces when we have Ng = p groups, each of size one. As before, we may assume without loss 
of generality, rescaling X and the noise as necessary, that condition (147p holds with constant one. 
Finally, we define the maximum group size m = max \GA- With this notation, we have the 

t=l,...,Ng 

following novel result: 
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Corollary 4. Suppose that the noise w is sub-Gaussian (|33p . and the design matrix X satisfies 
condition (|45p and the block normalization condition (I47p . If we solve the group Lasso with 



.m 1 - 1 / 01 I log Ng . 
X n >2a{ — + J-*-M-) t (48) 



n V n 



i/ien wi/i probability at least 1 — 2/Ng 2 , for any group subset Sg C {1, 2, . . . , iVg} it?i£/i cardinality 
\Sg\ = sg, any optimal solution 9\ n satisfies 

^ 0*1,2 ^ 4 ^n . , 4A n 



- » || 2 < — 



^5 + -^ EH^JU- (49) 



K r Kr 

C tiSg 

Remarks: Since the result applies to any a G [2,oo], we can observe how the choices of different 
group-sparse norms affect the convergence rates. So as to simplify this discussion, let us assume 
that the groups are all of equal size m, so that p = mNg is the ambient dimension of the problem. 

Case a = 2: The case a = 2 corresponds to the block (1,2) norm, and the resulting estimator is 
frequently referred to as the group Lasso. For this case, we can set the regularization parameter as 



X n = 2o{- s p^+ y log J^ e | . If we assume moreover that 9* is exactly group-sparse, say supported on 
a group subset Sg Q {1,2, ... , Ng} of cardinality sg, then the bound (j4*9j) takes the form 

\\e-e*\\l 1 & ^ + Sg l ° gNg - (so) 

n n 



Similar bounds were derived in independent work by Lounici et al. 4l| and Huang and Zhang 28J 
for this special case of exact block sparsity. The analysis here shows how the different terms arise, 
in particular via the noise magnitude measured in the dual norm of the block regularizer. 

In the more general setting of weak block sparsity, Corollary U] yields a number of novel re- 
sults. For instance, for a given set of groups Q, we can consider the block sparse analog of the 
l g -"ball" — namely the set 

Ng 



l q (R q ;G,2) :=leeW> \ \\0 Gt \\ q 2 < R q \ 
^ t=i ' 



In this case, if we optimize the choice of 5 in the bound (|49l) so as to trade off the estimation and 
approximation errors, then we obtain 



n n 



which is a novel result. This result is a generalization of our earlier Corollary [3l to which it reduces 
when we have Ng = p groups each of size m = 1. 



Case a = +oo: Now consider the case of (-i/ioo regularization, as suggested in past work 711 ] . 
In this case, Corollary H implies that \\6 - 0*||| < + slo | JVg . Similar to the case a = 2, this 



bound consists of an estimation term, and a search term. The estimation term is larger by a 
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factor of to, which corresponds to amount by which an ^QQ-ball in to dimensions is larger than the 
corresponding ^-ball. 

We provide the proof of Corollary [T] in Appendix IC.2I It is based on verifying the conditions 
of Theorem [TJ more precisely, we use Proposition [TJ in order to establish RSC, and we provide a 
lemma that shows that the regularization choice (|48[) is valid in the context of Theorem [TJ 



6 Discussion 

In this paper, we have presented a unified framework for deriving error bounds and convergence 
rates for a class of regularized M-estimators. The theory is high-dimensional and non-asymptotic 
in nature, meaning that it yields explicit bounds that hold with high probability for finite sample 
sizes, and reveals the dependence on dimension and other structural parameters of the model. Two 
properties of the M-estimator play a central role in our framework. We isolated the notion of a 
regularizer being decomposable with respect to a pair of subspaces, and showed how it constrains 
the error vector — meaning the difference between any solution and the nominal parameter — to lie 
within a very specific set. This fact is significant, because it allows for a fruitful notion of restricted 
strong convexity to be developed for the loss function. Since the usual form of strong convexity 
cannot hold under high-dimensional scaling, this interaction between the decomposable regularizer 
and the loss function is essential. 

Our main result (Theorem [TJ) provides a deterministic bound on the error for a broad class 
of regularized M-estimators. By specializing this result to different statistical models, we derived 
various explicit convergence rates for different estimators, including some known results and a range 
of novel results. We derived convergence rates for sparse linear models, both under exact and 



approximate sparsity assumptions, and these results have been shown to be minimax optimal 57[. 
In the case of sparse group regularization, we established a novel upper bound of the oracle type, 
with a separation between the approximation and estimation error terms. For matrix estimation, 
the framework described here has been used to derive bounds on Frobenius error that are known to 
be minimax-optimal, both for multitask regression and autoregressive estimation [H2|, as well as the 
matrix completion problem 51]. In recent work 0], this framework has also been applied to obtain 
minimax-optimal rates for noisy matrix decomposition, which involves using a combination of the 
nuclear norm and elementwise ^i-norm. Finally, in a result that we report elsewhere, we have also 
applied these results to deriving convergence rates on generalized linear models. Doing so requires 
leveraging that restricted strong convexity can also be shown to hold for these models, as stated in 
the bound (l43jl . 

There are a variety of interesting open questions associated with our work. In this paper, for 
simplicity of exposition, we have specified the regularization parameter in terms of the dual norm 
1Z* of the regularizer. In many cases, this choice leads to optimal convergence rates, including 
linear regression over £ g -balls (Corollary [3j) for sufficiently small radii, and various instances of low- 
rank matrix regression. In other cases, some refinements of our convergence rates are possible; 
for instance, for the special case of linear sparsity regression (i.e., an exactly sparse vector, with a 
constant fraction of non-zero elements), our rates can be sharpened by a more careful analysis of the 
noise term, which allows for a slightly smaller choice of the regularization parameter. Similarly, there 
are other non-parametric settings in which a more delicate choice of the regularization parameter 
is required H, 5^]. Last, we suspect that there are many other statistical models, not discussed in 
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this paper, for which this framework can yield useful results. Some examples include different types 
of hierarchical regularizers and /or overlapping group regularizers 29 L 3(3] , as well as methods using 
combinations of decomposable regularizers, such as the fused Lasso 69]. 
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A Proofs related to Theorem [1] 

In this section, we collect the proofs of Lemma Q] and our main result. All our arguments in 
this section are deterministic, and both proofs make use of the function T : W — > R given by 
J"(A) := C{9* + A) - C(9*) + X n {7Z(9* + A) - n(0*)}. In addition, we exploit the following fact: 
since J-(0) = 0, the optimal error A = 9 — 9* must satisfy J~(A) < 0. 

A.l Proof of Lemma [I] 

Note that the function T consists of two parts: a difference of loss functions, and a difference of 
regularizers. In order to control T , we require bounds on these two quantities: 

Lemma 3 (Deviation inequalities). For any decomposable regularizer and p- dimensional vectors 9* 
and A, we have 

71(9* + A) - 11(0*) > K{A M ±_) ~ n^M) - 2fc(*M-0- (51) 
Moreover, as long as X n > 27Z*(VC(9*)) and C is convex, we have 

C(9* + A) - C{9*) > [K{A M ) +K{A M± )]. (52) 

Proof. Since 71(0* + A) = 1Z(6* M + 9* M ±_ + A_^ + A^±) , triangle inequality implies that 

n{e* + A)>n{ex i + A M± )-n{e* M± + A M ) > n(e* M + a^x) -n{e* M± ) -n(A M ). 

By decomposability applied to 0* M and Aj^±, we have TZ(9* M + A_^±) = 71(6*^ + 7£(A^±), so 
that 

K(9* + A) > K(0* M ) + K(A M± ) - K{6* M± ) - TZ{A M ) . (53) 

Similarly, by triangle inequality, we have TZ(9*) < 71(9*^) +7Z(9* M ±). Combining this inequality 
with the bound (f53l) , we obtain 

7l{9* + A) - 71(9*) > 7l(9* M ) + 7Z{A M ^) - 7l{9* M± ) - 7Z(A M ) - {7l{9* M ) + 7l{9* M± )} 
= K(A M± )-K(A M )-2K(9* M± ), 
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which yields the claim (|5ip . 

Turning to the loss difference, using the convexity of the loss function C, we have 

C{9* + A) -C(9*) > (V£(0*),A) > -|(V£(0*),A}|. 

Applying the (generalized) Cauchy-Schwarz inequality with the regularizer and its dual, we obtain 

\(vc(e*),A)\<n*(vc(e*))n(A) < ^ [k{a m ) +k{a m± )}, 

where the final equality uses triangle inequality, and the assumed bound X n > 21Z* (V ' C{9*)) . Con- 
sequently, we conclude that C(9* + A) - C(9*) > -4f \jl{A M ) + TZ(A M± )] , as claimed. □ 

We can now complete the proof of Lemma [TJ Combining the two lower bounds (|5ip and (I52p . 
we obtain 

> F(A)>\ n {K(A M ^-K(A M )-2K(9* M± )}-^ [K{A M ) + K{A M ±)] 
= ^{K(A M± ) - m{A M ) - AK(9* M ±)}, 
from which the claim follows. 

A. 2 Proof of Theorem [I] 

Recall the set C(A4,A4- L ;9*) from equation (fT7|) . Since the subspace pair {M^M.^) and true 
parameter 9* remain fixed throughout this proof, we adopt the shorthand notation C. Letting 5 > 
be a given error radius, the following lemma shows that it suffices to control the sign of the function 
T over the set K(S) :=Cn{||A|| = 5}. 

Lemma 4. If F{A) > for all vectors A £ K(d), then \\A\\ < 5. 

Proof. We first claim that C is star-shaped, meaning that if A £ C, then the entire line {t A | t S (0, 1)} 
connecting A with the all-zeroes vector is contained with C. This property is immediate whenever 
9* £ M, since C is then a cone, as illustrated in Figure QJa). Now consider the general case, when 
6* £ M. We first observe that for any t £ (0, 1), 

Hj^(tA) = arg min ||tA — — y [ | = t arg min ||A 1| = tHj^(A), 

7S.M i&M t 

using the fact that j/t also belongs to the subspace Ai. The equality L T _ A ^±(tA) = tL T _ A ^±(A) follows 
similarly. Consequently, for all A £ C, we have 

K(U M x(tA)) = K(m M 4A))U tTZ(U M± (A)) < t {3TZ(U M (A)) + ATZ(U M± (9*))} 

where step (i) uses the fact that any norm is positive homogeneous H and step (ii) uses the inclusion 
A £ C. We now observe that 3 t TZ(Hj^ (A)) = 3 TZ(Ilj^ (t A)), and moreover, since t £ (0,1), we 
have At TZ(U M ±(9*)) < AK(I1 M ±(9*)). Putting together the pieces, we find that 

H(U M ±(tA)) < 3n(U M (tA)) + tAU M ±(9*) < 3TZ(U M (tA)) + ATZ(U M ^(9*)), 

5 Explicitly, for any norm and non-negative scalar t, we have \\tx\\ = t\\x\\. 
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showing that tA £ C for all t E (0, 1), and hence that C is star-shaped. 

Turning to the lemma itself, we prove the contrapositive statement: in particular, we show that 
if for some optimal solution 9, the associated error vector A = 9 — 9* satisfies the inequality ||A|| > S, 
then there must be some vector A £ K(<5) such that J- (A) < 0. If ||A|| > 5, then the line joining 
A to must intersect the set K(<5) at some intermediate point t*A, for some t* £ (0, 1). Since the 
loss function C and regularizer 1Z are convex, the function T is also convex for any choice of the 
regularization parameter, so that by Jensen's inequality, 

7{t*A) = j(fA + (l-f)0)<f J(A) + (l-f)J(0) = t*T(A), 

where equality (i) uses the fact that .F(O) = by construction. But since A is optimal, we must 
have J 7 (A) < 0, and hence ,F(i*A) < as well. Thus, we have constructed a vector A = t*A with 
the claimed properties, thereby establishing Lemma HI □ 

On the basis of Lemma [U the proof of Theorem [1] will be complete if we can establish a lower 
bound on J- (A) over for an appropriately chosen radius 5 > 0. For an arbitrary A £ K(5), we 
have 

T(A) = C(9* + A) - C(9*) + \ n {K(6* + A) - 11(9*)} 

> <V£(0*), A) + k c \\A\\ 2 - 4(9*) + \ n {lZ(9* + A) - 11(9*)} 

(ii) 

> (VC(9*), A) + k c \\A\\ 2 - r 2 c (9*) + \ n {U(A M ^) - U(A M ) - 2K(9* MX )}, 

where inequality (i) follows from the RSC condition, and inequality (ii) follows from the bound (I5ip . 

By the Cauchy-Schwarz inequality applied to the regularizer 1Z and its dual 1Z* , we have 
|(V£(0*),A)| <K*(VC(9*))K(A). Since A n > 2K*(VC(9*)) by assumption, we conclude that 
|(V£(V), A) | < *fK(A), and hence that 

F(A) > k c \\A\\ 2 - t 2 c (9*) + X n {n(A M ^) - Tl(A M ) - 2K(9* M± )} - ^K(A) 

By triangle inequality, we have TZ(A) = TZ(Aj^± + A^) < 1Z(Aj^j_) + TZ(Aj^), and hence, 
following some algebra 

J-(A) > k £ ||A|| 2 - t 2 c (9*) + \ n {\n(A M ±_) - \ll(A M ) - 2K(9* M± )} 

> KC \\A\\ 2 - 4(0*) - y{3^(A^) + 4K(9* M± )}. (54) 

Now by definition (f2~T|) of the subspace compatibility, we have the inequality TZ(Aj^) < ^(A4)\\Aj^\\. 
Since the projection A^ = 11^ (A) is defined in terms of the norm || ■ ||, it is non-expansive. Since 
£ M., we have 

||A^|| = ||n^(A)-n^(0)|| < ||A-0|| = ||A||, 

where inequality (i) uses non-expansivity of the projection. Combining with the earlier bound, we 
conclude that TZ(Aj^) < ^f(M)\\A\\. Substituting into the lower bound (J54]), we obtain ^"(A) > 
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kc || A|| 2 -t 2 c {9*) - k±{3^(M) || A || + 4TZ(9* M± )} . The right-hand side of this inequality is a strictly 
positive definite quadratic form in ||A||, and so will be positive for ||A|| sufficiently large. In partic- 
ular, some algebra shows that this is the case as long as 

||A|| 2 >5 2 := * 2 (M) + — {2tI(6*) + 4JI(6* m± )}, 

thereby completing the proof of Theorem [TJ 

B Proof of Lemma [2] 

For any A in the set C(S rf ), we have 



|| A||! <4||A 5 Ji + 4||^||i < ^\S v \\\A\\ 2 + 4R qV 1 -i < A^B~ q r ] ^l 2 ||A|| 2 + AR q n 1 '*, 

where we have used the bounds (|39|) and (|40p . Therefore, for any vector A G C(iL), the condi- 
tion (|31j) implies that 



in V n 



lXAh > K 1 ||A|| 2 - K2 ,/^{7^r ? -/ 2 ||A|| 2 + ^ ?? 1 -n 



RqlQgP _ / 2 \ logP D 1-g 

|A|| 2 ^ Ki — K2\l — - — v q/ } - K 2]j— — R q n q . 



By our choices n = % and X n = 4a J^, we have « 2 J^ip r^/ 2 = ^(l^) 1 "^ 



Kl — --a y n i "^V n (8ct) 

1 1 V A II 

which is less than k\/2 under the stated assumptions. Thus, we obtain the lower bound 11 ^J 12 > 



hi 



M A|| 2 — 2k 2 J 1 ^^- R q r] 1 q , as claimed. 



C Proofs for group-sparse norms 

In this section, we collect the proofs of results related to the group-sparse norms in Section [5j 
C.l Proof of Proposition [1] 

The proof of this result follows similar lines to the proof of condition ()31 j) given by Raskutti et 



al. [5g], hereafter RWY, who established this result in the special case of the ^i-norm. Here we 
describe only those portions of the proof that require modification. For a radius t > 0, define the 
set 

V(t) := {0 G RP | ||£ 1/2 #|| 2 = 1, ||0||0,a < t}, 

as well as the random variable M(t; X) := 1 — inffl g y( f ) ^ X J^ 2 ■ The argument in Section 4.2 of RWY 
makes use of the Gordon-Slepian comparison inequality in order to upper bound this quantity. 
Following the same steps, we obtain the modified upper bound 

E[M(t;X)} < - + -=E[ max \\w r |L*1 t, 
4 y/n l j=i,...,N g " J 
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where w ~ iV(0, E). The argument in Section 4.3 uses concentration of measure to show that this 
same bound will hold with high probability for M(t; X) itself; the same reasoning applies here. 
Finally, the argument in Section 4.4 of RWY uses a peeling argument to make the bound suitably 
uniform over choices of the radius t. This argument allows us to conclude that 

II V~fl|| 1 

inf 11 ' |2 > -||E 1/2 fl|| 2 -9Er max \\w r |L»1 lldllg Q for all 9 &MP 

emp y/n 4 l j = l,...,Ng °j J 

with probability greater than 1 — ci exp(— C211). Recalling the definition of pg(a*), we see that in 
the case £ = I pX p, the claim holds with constants (^1,^2) = (7, 9). Turning to the case of general 
£, let us define the matrix norm |||A||| Q . := max ||A/3|| Q *. With this notation, some algebra shows 

ll/^lla* 1 

that the claim holds with k\ = |A m i n (S 1 / 2 ) and K2 = 9 max KS 1 / 2 )^!^*. 

t=l,...,Ng 



C . 2 Proof of Corollary H 

In order to prove this claim, we need to verify that Theorem [1] may be applied. Doing so requires 
defining the appropriate model and perturbation subspaces, computing the compatibility constant, 
and checking that the specified choice (|48p of regularization parameter A n is valid. For a given 
subset Sg Q {1,2, ... , Ng}, define the subspaces 

M{Sg) := {6 G W j 9 Gt = for all i S ff }, and M ± {Sg) := {8 £ R p \ 9 Gt = for all t G Sg}. 

As discussed in Example [21 the block norm || • \\g^ a is decomposable with respect to these subspaces. 
Let us compute the regularizer-error compatibility function, as defined in equation (|2ip . that relates 
the regularizer (|| • ||g >Q in this case) to the error norm (here the ^-norm). For any A G Ai(Sg), we 
have 

l|A|b, a = W A Gt\\a < Yl W A Gth < V^||A|| 2 , 

teSg teSg 

where inequality (a) uses the fact that a > 2. 

Finally, let us check that the specified choice of X n satisfies the condition ()23|) . As in the proof 
of Corollary El we have V£(0*; ) = \X T w, so that the final step is to compute an upper bound 
on the quantity TZ*(^X T w) = ^ max^...^ ||(X T 

w )Gtlla* that holds with high probability. 

Lemma 5. Suppose that X satisfies the block column normalization condition ()47|) . and the obser- 
vation noise is sub-Gaussian (l33l) . Then we have 

< 2 exp ( - 2 log Ng) . (55) 

Proof. Throughout the proof, we assume without loss of generality that a = 1, since the general re- 
sult can be obtained by rescaling. For a fixed group G of size m, consider the submatrix Xq G M nxm . 

We begin by establishing a tail bound for the random variable ||— s — 



u X G t w u rm 1 " 1 / 01 /log Ng, 

max \ a * > 2a{ =- + x 

t=i,...,N g n L Jn V n ' 
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Deviations above the mean: For any pair w,w' E M n , we have 



n 



n 



1 



< 11x^(^-^0 



— max (Xq 9, (w — w')). 
n ||6»|L=i 



By definition of the (a — > 2) operator norm, we have 



1 



(0 1 



n 



X G (w - w')\\ a * < - \\X G \\ a ^2 \\W ~ w'\\ 2 < —^\\w-w'\\ 2 , 



where inequality (i) uses the block normalization condition (|47p . We conclude that the function 

w i — y || X G W \\ a * is a Lipschitz with constant 1/^/n, so that by Gaussian concentration of measure for 
Lipschitz functions [38j|, we have 



n 



1^1 
n 



+ 5 



n 



< 2 exp ( - — ) for all 5 > 0. 



(56) 



Upper bounding the mean: For any vector ft E M. m , define the zero- mean Gaussian random variable 

Zp = ^{P,Xqw}, and note the relation ||— — || a * = max Zp. Thus, the quantity of interest is the 

||/3||q=i 

supremum of a Gaussian process, and can be upper bounded using Gaussian comparison principles. 
For any two vectors ||/3|| Q < 1 and \\f3'\\ a < lj we have 



E 



(Zp - Zpi 



\\x G (p - < - 



y 2 i^iii 



2 

«->2 



n n 



) 2 
n 



P'Wl < -11/3 



2- 



where inequality (a) uses the fact that \\f3 — /3'\\ a < v2, and inequality (b) uses the block normal- 
ization condition (|47p . 

Now define a second Gaussian process Yp = \ ((3, e), where e ~ N(0, / mX m) is standard Gaus- 
sian. By construction, for any pair 0, /?' E M m , we have E[(Yp - Yp,) 2 ] = \ ||/3 - /3'||| > E[(Z /3 - Z^/) 2 ], 
so that the Sudakov-Fernique comparison principle [3^] implies that 



E 

By definition of Yp , we have 

^2 



n 



E 



max Z 

L8IL=l ' 



< E 



E 



max Yg 

II0[|«=1 



E llelL* 



E 



l/a* 



max 

l/3|| Q = l ^ 



< 



m^^Elleil *]) 



l/a* 



using Jensen's inequality, and the concavity of the function f(t) = t 1 ^* for a* E [1,2]. Fi- 
nally, we have (E[|ei|° < \/E [e 2 ] = 1 and l/a* = 1 — l/a, so that we have shown that 



E 



max 



l!/3|U=l^ 



< 2 ml ll a . Combining this bound with the concentration statement (1561) . we ob- 



tain 



l-l/a 



^G w II „ > 2 m 
n " a * — y/n. 



+ 5 



< 2 exp ( — ^f-). We now apply the union bound over all groups, 
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and set 5 2 = 4 log N s ^ Q conclude that 



max n 

t=i,...,N g n k Jn V n 



< 2exp( -21ogiV s ), 



as claimed. □ 
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